## 'data.frame': 46120 obs. of 16 variables:
## $ ListingNumber : int 658116 909464 1074836 1023355 1023355 1081604 840820 841846 875565 1011916 ...
## $ ListingCreationDate : POSIXlt, format: "2012-10-22" "2013-09-14" ...
## $ CreditGrade : Factor w/ 9 levels "","A","AA","B",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ Term : int 36 36 60 36 36 36 36 60 60 36 ...
## $ LoanStatus : Factor w/ 12 levels "Cancelled","Chargedoff",..: 4 4 4 4 4 8 4 4 4 4 ...
## $ ClosedDate : POSIXlt, format: NA NA ...
## $ BorrowerAPR : num 0.1253 0.2461 0.1542 0.0762 0.0762 ...
## $ BorrowerRate : num 0.0974 0.2085 0.1314 0.0629 0.0629 ...
## $ LenderYield : num 0.0874 0.1985 0.1214 0.0529 0.0529 ...
## $ EstEffectiveYield : num 0.0849 0.1832 0.1157 0.0522 0.0522 ...
## $ EstLoss : num 0.0249 0.0925 0.0449 0.0099 0.0099 ...
## $ EstReturn : num 0.06 0.0907 0.0708 0.0423 0.0423 ...
## $ ProsperRatingNumeric: int 6 3 5 7 7 5 6 2 3 5 ...
## $ ProsperRatingAlpha : Factor w/ 8 levels "","A","AA","B",..: 2 6 4 3 3 4 2 7 6 4 ...
## $ ProsperScore : num 9 4 10 9 11 4 8 4 8 7 ...
## $ ListingCategory : Factor w/ 21 levels "Not Available",..: 17 3 2 8 8 2 2 7 2 2 ...
## ListingNumber ListingCreationDate CreditGrade
## Min. : 416275 Min. :2009-07-13 00:00:00 :46120
## 1st Qu.: 546346 1st Qu.:2011-12-23 00:00:00 A : 0
## Median : 839039 Median :2013-07-14 00:00:00 AA : 0
## Mean : 774125 Mean :2012-11-16 17:38:44 B : 0
## 3rd Qu.: 967620 3rd Qu.:2013-10-18 00:00:00 C : 0
## Max. :1111680 Max. :2013-12-31 00:00:00 D : 0
## (Other): 0
## Term LoanStatus
## Min. :12.00 Current :31269
## 1st Qu.:36.00 Completed :10421
## Median :36.00 Chargedoff : 2648
## Mean :42.39 Defaulted : 506
## 3rd Qu.:60.00 Past Due (1-15 days) : 464
## Max. :60.00 Past Due (31-60 days): 211
## (Other) : 601
## ClosedDate BorrowerAPR BorrowerRate
## Min. :2009-08-27 00:00:00 Min. :0.04583 Min. :0.0400
## 1st Qu.:2012-05-31 00:00:00 1st Qu.:0.17061 1st Qu.:0.1399
## Median :2013-02-11 00:00:00 Median :0.22108 Median :0.1899
## Mean :2012-12-10 23:48:32 Mean :0.22950 Mean :0.1986
## 3rd Qu.:2013-09-17 00:00:00 3rd Qu.:0.29371 3rd Qu.:0.2599
## Max. :2014-03-10 00:00:00 Max. :0.42395 Max. :0.3600
## NA's :32545
## LenderYield EstEffectiveYield EstLoss EstReturn
## Min. :0.0300 Min. :-0.1827 Min. :0.0049 Min. :-0.18270
## 1st Qu.:0.1299 1st Qu.: 0.1193 1st Qu.:0.0449 1st Qu.: 0.07713
## Median :0.1799 Median : 0.1601 Median :0.0724 Median : 0.09120
## Mean :0.1886 Mean : 0.1691 Mean :0.0816 Mean : 0.09693
## 3rd Qu.:0.2499 3rd Qu.: 0.2235 3rd Qu.:0.1120 3rd Qu.: 0.11660
## Max. :0.3400 Max. : 0.3199 Max. :0.3660 Max. : 0.28370
##
## ProsperRatingNumeric ProsperRatingAlpha ProsperScore
## Min. :1.000 C :9860 Min. : 1.000
## 1st Qu.:3.000 B :8347 1st Qu.: 4.000
## Median :4.000 D :8073 Median : 6.000
## Mean :4.024 A :7642 Mean : 5.958
## 3rd Qu.:5.000 E :5461 3rd Qu.: 8.000
## Max. :7.000 HR :3956 Max. :11.000
## (Other):2781
## ListingCategory
## Debt Consolidation:29712
## Other : 5141
## Home Improvement : 3478
## Business : 2784
## Auto : 1209
## Household Expenses: 1052
## (Other) : 2744
At first I had difficulty narrowing down the variables I wanted to look at out of the 81, since most of them would be a factor in the default rate, but then I realize since the ratings/scores are basically a composite of the other variables, they are what I should be looking at since my mission, EDA, is probably not to derive an algorithem to determine the score from more than 60 variables.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1399 0.1899 0.1986 0.2599 0.3600
The BorrowerRate, should be a direct function of the borrowers ratings, looks to be rather normalized from 5% to 30%, except for the huge spike (also the mode of the distribution) at 32%, which is not only higher than the 3rd Q 26%, but also very close to the Max 36%.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0300 0.1299 0.1799 0.1886 0.2499 0.3400
Expected, LenderYield follows similar pattern
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.18270 0.07713 0.09120 0.09693 0.11660 0.28370
EstReturn, however, has a distribution that’s very close to a normal distribution, without the huge spike at the tail.
Conventional thinking would say the distribution of ratings would match that of interest rates
ProsperRatingNumeric has a normal distribution, with no signs of the odd spike in the tail end
Neither does ProsperScore
Of the rates above 30%, what’s their PRating and PScore like?
This makes sense
This does not, however from the previous plot we can suspect BorrowerRate should be a direct function of PRating (case basis) instead of PScore (individual basis)
Let’s look at the borrower rates for the worst ratings and scores
Both low PRating and PScore reflect similar distribution with the most of the data having that huge spike at 31%-32% range
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1299 0.1760 0.1858 0.2379 0.3600
However even when PScore is 4 and above, the 32% spike still exists
So the dataset I’m looking at contains 46120 obvservations and 16 variables.
The relationship that I’ll be exploring mainly, but not exclusively, is the one between the various ratings and borrower rate, since the ratings are a result of assessment on the borrowers.
The lender’s rate and listing categories.
I narrow down the focus on the spike of data with around 32% interest rate, and further explore subsets of low PRating and low PScore.
First we can see the bulk of the BorrowerRate is between 10% to 20%, however there’s a huge spike at around 32% which is the mode, without it the distribution would have been rather normalized.
## BorrowerAPR BorrowerRate LenderYield
## BorrowerAPR 1.000 0.994 0.994
## BorrowerRate 0.994 1.000 1.000
## LenderYield 0.994 1.000 1.000
## EstEffectiveYield 0.866 0.866 0.866
## EstLoss 0.943 0.939 0.939
## EstReturn 0.766 0.787 0.787
## ProsperRatingNumeric -0.959 -0.951 -0.951
## ProsperScore -0.661 -0.644 -0.644
## EstEffectiveYield EstLoss EstReturn
## BorrowerAPR 0.866 0.943 0.766
## BorrowerRate 0.866 0.939 0.787
## LenderYield 0.866 0.939 0.787
## EstEffectiveYield 1.000 0.748 0.775
## EstLoss 0.748 1.000 0.533
## EstReturn 0.775 0.533 1.000
## ProsperRatingNumeric -0.820 -0.959 -0.621
## ProsperScore -0.618 -0.676 -0.340
## ProsperRatingNumeric ProsperScore
## BorrowerAPR -0.959 -0.661
## BorrowerRate -0.951 -0.644
## LenderYield -0.951 -0.644
## EstEffectiveYield -0.820 -0.618
## EstLoss -0.959 -0.676
## EstReturn -0.621 -0.340
## ProsperRatingNumeric 1.000 0.702
## ProsperScore 0.702 1.000
We see some highly correlated variables, especially BorrowerRate and LenderYield. Previously raised question of ProsperScore has about the lowest correlation among the other variables, typically aroun -0.6, its corr with EstimatedReturn is as low as -0.34, the lowest of all correlations.
The correlation between EstimateLoss and EstimatedReturn also have a rather low correlation of 0.53.
The following graph would appear more clearly with “Zoom” button or in Knitr
Let’s look at 2 and 3 first before we go back to the main topic 1.
That’s two VERY distinct groups, should definitely check that out in multivariate later.
and for 3
The peak(Mode) of EstLoss is not at the tail end like the BorrowerRate. However 15-17% is still strangely higher than 13-15%.
Back to 1, PScore and BorrowrRate
In boxplot, the trend looks rather neat, outside of PScore 5 which shows a break in the trend (higher median, mode and quartile), there’s a clear trend. From the fact that the range between 15% and 30% would not be outliers from PScore 2 to 8, means the spread is big.
But when looking at PRating, the range gets significantly smaller.
Dissapointedly, neither Boxplots show the odd concentration at 32%, let’s try geom_jitter.
Above we see the unnatural concentration of 32% from PScore 1-6 and it dissappears abruptly at 7. I hope multivariate can tell us more about 32%.
With PRating we can see the concentration of 32% at PRating 1 and 2.
Even though we gathered from the previous plots that the number of 32% drops off at PScore-7, from this graph we can see, that is likely because the number of PRating-2 drops of at PScore-7, this all but confirms that BorrowerRate is a direct function of PRating and less of one of PScore.
Looking at the relationship between PRating and EEY/EL/ER, we can see how from PR1 to PR7 the distribution goes from top heavy to bottom heavy, with PR3 having about equal outliers on either side. However, outliers aside, PR1 seems to behaves differently from the others, its top end looks to be capped too low and out of the trend. Let’s switch to the boxplot for another view.
Looking at PR1 here, while its outliers, median and 3rd quartile(bottom of the box) seems to follow the overall trend, its mean and 1st quartile (top of the box) doesn’t. One could argue the mean was dragged down by the big number of downside outliers. However that doesn’t explain why its top end seems to be capped at about 14% when through PR7 to PR2 it’s increasing until it falls off a cliff at PR1. Could lender be irrationaly undervaluing PR1, or was there a hardcap mechanism specifically for PR1, which would not make sense since lenders would be expecting higher return for higher risk.
To further explore bulkiness at 32%, we look at some other categorical variable, below is LoanStatus
We can see a high concentration of 32% especially the chargedoff, defaulted, and completed loans.
Almost across all the categories(that has enough data), there’s a high amount of 32%, and the first five categories also shows bulk at 35%, all these seems too artificial. I wouldn’t be surprised if a hard cap has been set at 32% for all, and some 35% exceptions.
I found it interesting the highest average rate is Household expense, and the lowest is Boat. Two ends of the luxury/essential spectrum?
Borrower/Lender Rates correlates strongly with PRating yet with PScore it would show a lot more variation.
The graph between EstEffectiveYield and EstReturn shows two VERY distinctive groups.
While BorrowerRate and LenderYield has a perfect 1 corr, that’s more of a given and less interesting. PRating has up to 0.95 corr with both BorrowerRate and EstLoss
back to looking at EstReturn and EstEffectiveYield and its plot that shows two distinct groups, one being a perfect corr 1 line. We can see how the PRating relates to each group oppositely, where on the corr 1 line, the higher the Pscore, the higher the EstReturn/EstEY. For the other group, the opposite applies, the higher the PScore, the lower the ER/EEY. One thing to not is that there doesn’t seem to be anything higher then PR4 in the corr-1 line.
Not only was that contradictory, it also doesn’t help us identify which variable distinguishes the two groups.
LoanStatus tells us a lot more. We can see all the data in the perfect corr 1 line are ones that are loans that are no longer active, either “Completed”, “Chargedoff” or “Defaulted”. While none of the active loans are in the straight line, some inactive loans are also not in the straight line. Which begs the question, by definition EstReturn was assigned to the loan when the loan is created, but this chart simply shows there must’ve been some adjustments made when a loan becomes inactive, yet not along the way, to cause the unnatural straight line. I found no explanation of the on the Prosper website but it could be something to ask specifically.
Let’s look at EstLoss, another variable in the formula of (EstEffectiveYield - EstLoss = EstReturn) From the many vertical straight lines, we can clearly see that for the inactive loans, the EstLoss is rather discrete (preset?) than continous. Here, just like the previous plot, we clear instinctions between active/inactive loans.
EstLoss seems to be a direct function of the PRating (soley)! From that we can deduce that BorrowerRate/EstEffectiveRate is where the discretionary happens.
The three plots here tells a bit of a story. BorrowerRate is the most straight forward. EEY is BorrowerRate after discounting fees and uncollected interest. EstReturn is EEY after discounting EstLoss.
From the first two plots we can see, going from BorrowerRate to EEY, the range of the data gets squeezed a little bit from the top down, the top goes from 32%+ to 26%+, while the bottom of the data (dark green) remains largely unchanged going only from 6% to 5%.
Note the biggest difference between the two plots is only seen in the inactive loans (completed, defaulted and charged-off) where we see some PR1 goes from 32% to as low as -20%.
The third plot, EstReturn, gets more interested. While the bands expectedly get even more squeezed from the top down, what’s less expected is the distribution of the colors, where the dark-brown (PR1) is now more concentrated in the middle of the band instead of on top, which goes against logic, why would PR1 have an EstReturn that’s in the middle between PR2 and PR7. Since the difference between the 2nd and 3rd plot is EstLoss, can we suspect that the EstLoss of PR1 is overestimated?
The above two plots we’ve looked at in Bivariate, and now breaking them down by LoanStatus. While we can now see that all the bunching are of the inactive loans, also their rates seem to be more arbitrary. Still we aren’t able to tell why PR1 is behaves differently from the trend, regardless of active/inactive atatus.
Since the inactive loans appears to have arbitrary EstReturn, until we find out more about the adjustment mechanism upon “completing” a loan, they’re not that helpful in terms of help predicting return. Let’s look at the data again with only active loans.
We see about 6-8 distinguished curves. Obvs the top one would indicat the best return for similar BorrowerRate. Surprising PRating is not the distinct factor of each curve, although we do see the relationship between the rating and return.
Running out of relevant variables, I factor in Term (length of term), which seems to be able to break the curves into two groups, and is a pretty good indicator of which curves the data would lie on. From that we can roughly deduce, given the same BorrowerRate, the longer the term the better the EstReturn.
It would have been best to find the variable that distinguish the six curves. While Term only gives us only two, it’s as good as I can find at the moment.
There are two very distinct groups of data, one is discrete and the other continuous. While most of the signs points to it being Active/Inactive, it’s not conclusive since half of Inactive data also follows the “continuous” pattern seen in Active. We do know none of the Active data follows the “discrete” pattern that can only be seen in Inactive data. Further efforts will have to be made in identifying which variable that’s causing the “discrete” pattern.
The variable Term (length of the loan) is more relevant than I had presumed.
Back to the previous EEY/ER chart but this time colored with Term. We can easily see all the “discrete data” is not only inactive, but all of them 36-month term. Although there are still 36-month data that doesn’t fall on the straight line, further efforts can be made on narrowing down what other variables only apply to 36-month data. Also why is there no active 12-month loans?
This chart also makes more sense now that Term is considered, we can see only 36-month terms have been rated PR1, and 12-month terms explains the bunching at the bottom. We can now attribute some of the “dicrete” patterns to Term being discrete (only three fixed terms) instead of continuous. However, we still have yet to find out why PR1 doesn’t havea higher EstReturn than PR2 even considering only 36-month terms.
To filter out possible policy changes within the perioud that might have created noise to the previous plot, we now subset only the active loans. This plot gives us some clarity to the floor of EstReturn for each PR, but also begs for questions on the seemingly artificial caps of EstReturn on 36-month PR1 and 60-month PR2. A confirmation and explanation from Prosper would be helpful since it’s not intuitive why PR3 would have the highest potential return, higher than PR2 and PR4. Or should it be a liberty of potential lenders chasing higher potential returns to give 60-month loans to PR1?
I utilized the data from the point of view of a potential lender. What would be my potential return given the different levels of risk that I could take.
Not knowing which variables to focus on besides EstReturn, caused me to spend some time on exploring variable that I assumed would be more relevant (PScore), and also missed out on more important variables until later in the analysis (Term).
In the most general sense, the common basis of finance still stands; The longer term and bigger risk, and the bigger expected reward. However there are some prominent patterns outside that curve.
While the biggest mystery, why certain data appears more discrete and arbitrary than continuous, has not been solved. We have narrowed down some specific questions and I’m certain with some queries to and answers from Prosper, such as Why is there only 36-month terms for PR1, yet not 12-month nor 60-month? Why is there no Current ongoing loans that has EstReturn = EstEffectiveYield? Why is there almost no active loans of 12-month?
We would be able to isolate those variables and be able to build a better model once those questions are answered.